Data “tidying”

You’ve probably heard the idea that data analysis is 80 percent cleaning data or something like this. I think this is generally true, but “cleaning” obscures the wide variation of data manipulation that is required to go from a dataset found in the wild to a dataset that is ready to analyze.

One common issue that you’ll find in datasets that you want to work with is that they aren’t the right “shape” for the analysis you want to perform. In other words, you might be trying to calculate a new variable or make a chart, but the rows and columns in your dataset aren’t quite right. In R and the tidyverse, the shape of the data that is most commonly used is called “tidy data.”

Take for example the following two datasets of California, Oregon, and Washington population between 1990 and 2020:

df_a
## # A tibble: 12 × 3
##    state  year      pop
##    <chr> <dbl>    <dbl>
##  1 CA     1990 29950111
##  2 CA     2000 33987977
##  3 CA     2010 37319550
##  4 CA     2020 39368078
##  5 OR     1990  2858547
##  6 OR     2000  3429708
##  7 OR     2010  3837614
##  8 OR     2020  4241507
##  9 WA     1990  4900780
## 10 WA     2000  5910512
## 11 WA     2010  6743009
## 12 WA     2020  7693612
df_b
## # A tibble: 4 × 4
##    year       CA      OR      WA
##   <dbl>    <dbl>   <dbl>   <dbl>
## 1  1990 29950111 2858547 4900780
## 2  2000 33987977 3429708 5910512
## 3  2010 37319550 3837614 6743009
## 4  2020 39368078 4241507 7693612

Both of these datasets contain the exact same data, but they are stored in a different format. The concept of tidy data describes why one of these datasets is easier to work with and one is harder.

What is tidy data?

  1. Every column is a variable.

  2. Every row is an observation.

  3. Every cell is a single value.

In the example above, df_a is tidy, while df_b is messy. Why?

In df_a we have three variables, state, year, and population, and each of these variables is in it’s own column. In df_a, each row represents an observation for one state in one year. And in df_a, each there is only one value in each cell of the table.

On the other hand, in df_b, each column is a state, but in this case, states are actually values, not variables, and belong in their own column. Additionally, there is no single column that represents the variable we are most interested in, population. In df_b, each row represents a year, but there are multiple observations in each row for each year – one per state. The third requirement of tidy data is not broken by df_b.

So what?? I like my data messy!

{ggplot2} and many other functions from the tidyverse work best when data is in a tidy format. Let’s say we want to make line chart of population over time by state using the example data above. In most line charts we would have year on the x-axis, population on the y-axis, and one line per state. If you remember last week’s lesson on {ggplot2}, in order to do this in {ggplot2} we’ll need to assign these variables to aesthetics. In this case, there are three aesthetics we want map with our data: x, y, and color.

Using df_a, this mapping of columns to aesthetics is straightforward because we have a column for year, a column for population, and a column for state:

df_a |> 
  ggplot(aes(x = year, y = pop, color = state)) +
  geom_line()

But how would we make the same plot using df_b? We can’t because population is spread across three columns, and the states are column names, not values in our data.

df_b |> 
  ggplot(aes(x = year, y = ???, color = ???)) +
  geom_line()

How to tidy data

So, now that you’ve seen an example in which we need to tidy our data in order to plot it using {ggplot2}, how do we go from messy to tidy? Two key functions from the {tidyr} package to transform messy data to tidy data are pivot_longer() and pivot_wider().

Lengthen

If we want to go from df_b to df_a, we’re going to need to “lengthen” our data. This is a common issue when data is stored in column names. As described above, the data that we want to be values in our table are the state abbreviations, which are currently the column names.

df_b
## # A tibble: 4 × 4
##    year       CA      OR      WA
##   <dbl>    <dbl>   <dbl>   <dbl>
## 1  1990 29950111 2858547 4900780
## 2  2000 33987977 3429708 5910512
## 3  2010 37319550 3837614 6743009
## 4  2020 39368078 4241507 7693612
df_b_tidy <- df_b |> 
  pivot_longer(
    cols = c(CA, OR, WA),
    names_to = "state",
    values_to = "pop"
  )

df_b_tidy
## # A tibble: 12 × 3
##     year state      pop
##    <dbl> <chr>    <dbl>
##  1  1990 CA    29950111
##  2  1990 OR     2858547
##  3  1990 WA     4900780
##  4  2000 CA    33987977
##  5  2000 OR     3429708
##  6  2000 WA     5910512
##  7  2010 CA    37319550
##  8  2010 OR     3837614
##  9  2010 WA     6743009
## 10  2020 CA    39368078
## 11  2020 OR     4241507
## 12  2020 WA     7693612

The key arguments to pivot_longer() are:

  • cols – select which columns to pivot into values
  • names_to – name of new column names are stored in
  • values_to – name of new column values are stored in

In order to tidy this data, we’ve created a new column state which contains state abbreviations that were previously stored as column names. We’ve created second new column pop that now contains the population of each state in each year. And now we can use the same ggplot2 code from above and make our line chart.

df_b_tidy |> 
  ggplot(aes(x = year, y = pop, color = state)) +
  geom_line()

Widen

Transforming data from wide to long, is probably the more common operation in data cleaning, but there are instances where you’ll want the data to be wide. This might be to make a table for presentation, or to run statistical models, or sometimes to create a tidy data set you need first lengthen and then widen data.

As a straightforward example to show how pivot_wider() works, we’ll take our newly tidied data, df_b_tidy and turn it make into the messy version we started with.

df_b_tidy |> 
  pivot_wider(
    names_from = "state",
    values_from = "pop"
  )
## # A tibble: 4 × 4
##    year       CA      OR      WA
##   <dbl>    <dbl>   <dbl>   <dbl>
## 1  1990 29950111 2858547 4900780
## 2  2000 33987977 3429708 5910512
## 3  2010 37319550 3837614 6743009
## 4  2020 39368078 4241507 7693612

The key arguments to pivot_wider() are similar to pivot_longer(), but in this case we need to specify in which column the new column names should come from, and which column contains the data or values we want to put in our wider table.

Data types

If you’ve taken a statistics or research methods class, you might have encountered the concept of data types. The idea is that data can be stored in different ways and depending on the type of data, you can conduct different kinds of analyses. The most general data types are quantitative and qualitative. At a high-level, quantitative data are things that can be measured and represented by numbers (e.g. state population) and qualitative data are things that can be measured and represented by text (e.g. state flower). Within both quantitative and qualitative data, there are more specific types and these data types have equivalents when working with data in programming languages. We’re not going to talk about all the data types and when and why you need them in this lesson. The basic concept to understand is that data types represent things in the real world that we measure and store and we should choose the most appropriate computational data type for the real world thing that we are measuring.

Categorical and ordinal data

There are two more specific data types of qualitative data that are relevant to this lesson. Categorical (also called nominal) data is quantitative data that has no inherent order or ranking in it. For example, the possible values of eye color in a DOC data system may be: blue, green, brown, gray, hazel. But there is no way to rank these eye colors from highest to lowest or best to worst. Ordinal data, on the other hand, is data that has an clear ordering of the options. For example, the possible values of a risk assessment in a DOC data system may be: low, medium, high. In this case, it is possible to order these values from least to most risk.

Using appropriate data types is essential for a number of statistical processes, but in this lesson, we care about what data types can do for us when plotting data using {ggplot2}. The biggest reason we care about data types, and specifically categorical and ordinal data is that they dictate the order of groups and categories in plots.

Order matters

Let’s say we have data on average sentence length of people by assessed risk level and we’d like to plot this.

risk_sentence
## # A tibble: 4 × 2
##   risk_level mean_sentence_length
##   <chr>                     <dbl>
## 1 Low                          12
## 2 Medium                       18
## 3 High                         30
## 4 Very high                    60

An appropriate way to visualize this data is a bar chart, where risk level is on the x-axis and sentence length is on the y-axis. Let’s check it out!

risk_sentence |> 
  ggplot(aes(x = risk_level, y = mean_sentence_length)) +
  geom_col() + # geom_col is equivalent to geom_bar(stat = "identity") from the last lesson
  labs(
    title = "Average sentence length by risk level",
    x = "Risk level",
    y = "Mean sentence length (months)"
  )

There is nothing incorrect about this plot, but it is confusing to interpret because the risk levels are not in a sensible order (by default they are ordered alphabetically). A better plot would have the risk levels on the x-axis in order from lowest to highest.

To achieve this, we’ll have to use what’s called in R, a factor variable. This is a special kind of qualitative data that has a set number of possible categories and an in this case a defined order. There is a package that is part of the tidyverse called {forcats} that has a lot of useful functions to make it easier to work with factors, so we’re going to make use of that. If you’ve already attached the tidyverse, using library(tidyverse), {factors} will be ready to use.

Let’s use mutate() from {dplyr} to modify the risk_level column and change it into an ordered factor (from low to very high). Within mutate(), we use fct_relevel() to define the levels and orders of the new factor we create.

risk_sentence_ordered <- risk_sentence |> 
  mutate(risk_level = fct_relevel(risk_level, "Low", "Medium", "High", "Very high"))

risk_sentence_ordered$risk_level
## [1] Low       Medium    High      Very high
## Levels: Low Medium High Very high

Now, if we make this same plot as above it our x-axis is ordered how we want it!

risk_sentence_ordered |> 
  ggplot(aes(x = risk_level, y = mean_sentence_length)) +
  geom_col() +
  labs(
    title = "Average sentence length by risk level",
    x = "Risk level",
    y = "Mean sentence length (months)"
  )

Note that we use fct_relevel() to manually define the order of the values or levels in the factor variable that we want to create. In this next example, we’ll use a different function from {forcats} to create a factor variable that is ordered by the value of another column.

race_sentence <- tribble(
  ~race, ~mean_sentence_length,
  "Black", 12,
  "Asian", 18,
  "White", 30,
  "Latino", 60
)

Here, we have a similar dataset as above, but this time we have average sentence length by the race of the person sentenced. We have very similar ggplot2 code, except we change the dataset we are using and change the variable mapped to the x-axis aesthetic from risk_level to race.

race_sentence |> 
  ggplot(aes(x = race, y = mean_sentence_length)) +
  geom_col() +
  labs(
    title = "Average sentence length by race",
    x = "Race",
    y = "Mean sentence length (months)"
  )

Again, there is nothing “wrong” about this plot, but as with the risk level plot, the x-axis is ordered alphabetically by default. To make this easier to interpret, we might want to order the x-axis by average sentence length so it’s easier for some to quickly see a pattern. We’re going to use fct_reorder() to accomplish this.

race_sentence_ordered <- race_sentence |> 
  mutate(race = fct_reorder(race, mean_sentence_length))

race_sentence_ordered$race
## [1] Black  Asian  White  Latino
## Levels: Black Asian White Latino

The key arguments to fct_reorder() are .f and .x where .f (the first argument) is a character or factor column you want to modify and .x is column you want to use to sort or order by. In this case, we are saying that we want to modify race to become an ordered factor variable that is sorted by mean_sentence_length.

So looking at the plot again, we get:

race_sentence_ordered |> 
  ggplot(aes(x = race, y = mean_sentence_length)) +
  geom_col() +
  labs(
    title = "Average sentence length by race",
    x = "Race",
    y = "Mean sentence length (months)"
  )

Now instead of the x-axis being in alphabetical order, it is ordered by sentence length (low to high). Depending on what we are trying to convey in this plot, we might want to reverse the order so that x-axis is ordered from high to low. To do this, set the .desc argument in fct_reorder() to TRUE so that race is now ordered in descending order based on mean_sentence_length.

race_sentence_ordered |> 
  mutate(race = fct_reorder(race, mean_sentence_length, .desc = TRUE)) |> 
  ggplot(aes(x = race, y = mean_sentence_length)) +
  geom_col() +
  labs(
    title = "Average sentence length by race",
    x = "Race",
    y = "Mean sentence length (months)"
  )

Another common use case for factors in ggplot2 is adjusting the order of legend items. Returning to our plot from the tidy data section of state population, notice that by default, the legend is ordered alphabetically (CA, OR, WA).

df_a |> 
  ggplot(aes(x = year, y = pop, color = state)) +
  geom_line()

This is another case, where this plot is not incorrect, but could be improved if we adjust the items in the legend so that they are in order of the lines in our plot. Factors to the rescue!

We’ll use the same concept as ordering the race column above, but this time reorder the state column to be sorted (in descending order) by pop.

df_a |> 
  mutate(state = fct_reorder(state, pop, .desc = TRUE)) |> 
  ggplot(aes(x = year, y = pop, color = state)) +
  geom_line()

Subtle difference, but now this plot is easier to read because the legend items are in the same order as the lines.

{forcats} has many other functions that help you create and modify factors, but, but for the purposes of plotting, fct_relevel() and fct_reorder() are two of the most useful!

Advanced tidying

We covered the basic transformations needed to convert data from wide to long (and back again), but the data you find in the wild will likely be even messier than the example data we used above. For this next section we’ll use a dataset published by the Vera Institute of jail population by county. You can read in the CSV file from Vera’s GitHub like this:

url <- "https://github.com/vera-institute/incarceration-trends/raw/master/incarceration_trends.csv"
vera_incarceration <- read_csv(url)

This is quite a dataset, so for now we’ll work with one state (Arizona) and a subset of columns in the data set.

az_jail_county <- vera_incarceration |> 
  filter(state == "AZ") |> 
  select(
    year,
    state,
    county_name,
    urbanicity,
    total_pop_15to64,
    female_pop_15to64,
    male_pop_15to64,
    total_jail_pop,
    male_jail_pop,
    female_jail_pop
  )

When I’m working with a new dataset, I almost always start with some basic exploration to see what we’ve got.

az_jail_county |> 
  glimpse()
## Rows: 735
## Columns: 10
## $ year              <dbl> 1970, 1971, 1972, 1973, 1974, 1975, 1976, 1977, 1978…
## $ state             <chr> "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ", "AZ"…
## $ county_name       <chr> "Apache County", "Apache County", "Apache County", "…
## $ urbanicity        <chr> "rural", "rural", "rural", "rural", "rural", "rural"…
## $ total_pop_15to64  <dbl> 17173, 18752, 19818, 21418, 22079, 23811, 24868, 275…
## $ female_pop_15to64 <dbl> 8886, 9698, 10248, 11086, 11439, 12357, 12909, 14299…
## $ male_pop_15to64   <dbl> 8287, 9054, 9570, 10332, 10640, 11454, 11959, 13250,…
## $ total_jail_pop    <dbl> 7.00, 7.50, 8.00, 8.50, 9.00, 9.50, 10.00, 10.50, 11…
## $ male_jail_pop     <dbl> 4.00, 4.75, 5.50, 6.25, 7.00, 7.75, 8.50, 9.25, 10.0…
## $ female_jail_pop   <dbl> 0.00, 0.12, 0.25, 0.38, 0.50, 0.62, 0.75, 0.88, 1.00…

What years does this data cover?

range(az_jail_county$year)
## [1] 1970 2018

What counties are included and how many rows per county?

az_jail_county |> 
  count(county_name)
## # A tibble: 15 × 2
##    county_name           n
##    <chr>             <int>
##  1 Apache County        49
##  2 Cochise County       49
##  3 Coconino County      49
##  4 Gila County          49
##  5 Graham County        49
##  6 Greenlee County      49
##  7 La Paz County        49
##  8 Maricopa County      49
##  9 Mohave County        49
## 10 Navajo County        49
## 11 Pima County          49
## 12 Pinal County         49
## 13 Santa Cruz County    49
## 14 Yavapai County       49
## 15 Yuma County          49

What is urbanicity?!

az_jail_county |> 
  count(urbanicity)
## # A tibble: 4 × 2
##   urbanicity     n
##   <chr>      <int>
## 1 rural        343
## 2 small/mid    294
## 3 suburban      49
## 4 urban         49

Cool! We have a basic handle on this data and one of the first things I noticed is that we’ve got un-tidy data. We have data for total population, and jail population, but this data is also broken down by male and female. And each of these sub-populations has it’s own column. If for example we wanted to make a line chart of jail population over time by sex, we couldn’t do it with data in this format.

az_jail_county |> 
  ggplot(aes(x = year, y = ???, color = ???))

So, as before, we’ve got to tidy this data up! As you can see this data set is a little more complicated than our example dataset above, so we’ll need to do a few more transformations before we can this data into a tidy format we can plot and analyze.

The first step is to pivot_longer() so that all our values become a single population column that we’ll call n.

az_jail_county |> 
  pivot_longer(
    cols = -c(year, state, county_name, urbanicity),
    names_to = "var",
    values_to = "n"
    )
## # A tibble: 4,410 × 6
##     year state county_name   urbanicity var                     n
##    <dbl> <chr> <chr>         <chr>      <chr>               <dbl>
##  1  1970 AZ    Apache County rural      total_pop_15to64  17173  
##  2  1970 AZ    Apache County rural      female_pop_15to64  8886  
##  3  1970 AZ    Apache County rural      male_pop_15to64    8287  
##  4  1970 AZ    Apache County rural      total_jail_pop        7  
##  5  1970 AZ    Apache County rural      male_jail_pop         4  
##  6  1970 AZ    Apache County rural      female_jail_pop       0  
##  7  1971 AZ    Apache County rural      total_pop_15to64  18752  
##  8  1971 AZ    Apache County rural      female_pop_15to64  9698  
##  9  1971 AZ    Apache County rural      male_pop_15to64    9054  
## 10  1971 AZ    Apache County rural      total_jail_pop        7.5
## # ℹ 4,400 more rows

Notice that instead of specifying which columns I want to pivot in the cols argument, in this case I specified which columns I did not want to pivot using -c() (i.e. all columns except the ones I named).

This is a good first step, but that the values of our new var column show that we have data for total population, female population, and male population as well as data for total jail population, female jail population, and male jail population. This is actually two separate variables or concepts in one column: demographic group (total, male, female) and population type (total county and jail). We can use another function from {tidyr} to split this column into it’s two parts, separate_wider_delim(). With this function we can split a single column into two (or more) new columns based on a delimiter we specify that identifies how to split the data.

az_jail_county |> 
  pivot_longer(
    cols = -c(year, state, county_name, urbanicity),
    names_to = "var",
    values_to = "n"
    ) |> 
  separate_wider_delim(
    cols = var,
    delim = "_",
    names = c("dem_group", "pop_type"),
    too_many = "merge"
    )
## # A tibble: 4,410 × 7
##     year state county_name   urbanicity dem_group pop_type         n
##    <dbl> <chr> <chr>         <chr>      <chr>     <chr>        <dbl>
##  1  1970 AZ    Apache County rural      total     pop_15to64 17173  
##  2  1970 AZ    Apache County rural      female    pop_15to64  8886  
##  3  1970 AZ    Apache County rural      male      pop_15to64  8287  
##  4  1970 AZ    Apache County rural      total     jail_pop       7  
##  5  1970 AZ    Apache County rural      male      jail_pop       4  
##  6  1970 AZ    Apache County rural      female    jail_pop       0  
##  7  1971 AZ    Apache County rural      total     pop_15to64 18752  
##  8  1971 AZ    Apache County rural      female    pop_15to64  9698  
##  9  1971 AZ    Apache County rural      male      pop_15to64  9054  
## 10  1971 AZ    Apache County rural      total     jail_pop       7.5
## # ℹ 4,400 more rows

The key arguments here are

  • cols: which column to split
  • delim: what character delimiter to use to split on
  • names: names of new columns to create
  • too_many: what to do if there is “extra” stuff in a split column

And so you can see that we now have two columns called dem_group and pop_type in place of the var column. While we’re at it, let’s also clean up the pop_type column so it’s a little more clear what this column represents.

az_jail_county |> 
  pivot_longer(
    cols = -c(year, state, county_name, urbanicity),
    names_to = "var",
    values_to = "n"
    ) |> 
  separate_wider_delim(
    cols = var,
    delim = "_",
    names = c("dem_group", "pop_type"),
    too_many = "merge"
    ) |> 
  mutate(
    pop_type = if_else(pop_type == "pop_15to64", "pop_total", "pop_jail")
    )
## # A tibble: 4,410 × 7
##     year state county_name   urbanicity dem_group pop_type        n
##    <dbl> <chr> <chr>         <chr>      <chr>     <chr>       <dbl>
##  1  1970 AZ    Apache County rural      total     pop_total 17173  
##  2  1970 AZ    Apache County rural      female    pop_total  8886  
##  3  1970 AZ    Apache County rural      male      pop_total  8287  
##  4  1970 AZ    Apache County rural      total     pop_jail      7  
##  5  1970 AZ    Apache County rural      male      pop_jail      4  
##  6  1970 AZ    Apache County rural      female    pop_jail      0  
##  7  1971 AZ    Apache County rural      total     pop_total 18752  
##  8  1971 AZ    Apache County rural      female    pop_total  9698  
##  9  1971 AZ    Apache County rural      male      pop_total  9054  
## 10  1971 AZ    Apache County rural      total     pop_jail      7.5
## # ℹ 4,400 more rows

And finally, the last step here to make a tidy data set is to pivot once again so that for each row we have one column that is the total population and one column that is jail population. While we’re at it, we might as well create one more new variable since we have jail population and total population: incarceration rate. Since we have this nice tidy dataset, that will just need some simple division.

az_jail_county_tidy <- az_jail_county |> 
  pivot_longer(
    cols = -c(year, state, county_name, urbanicity),
    names_to = "var",
    values_to = "n"
    ) |> 
  separate_wider_delim(
    cols = var,
    delim = "_",
    names = c("dem_group", "pop_type"),
    too_many = "merge"
    ) |> 
  mutate(
    pop_type = if_else(pop_type == "pop_15to64", "pop_total", "pop_jail")
    ) |> 
  pivot_wider(
    names_from = "pop_type",
    values_from = "n"
    ) |> 
  mutate(rate = pop_jail / pop_total)

az_jail_county_tidy
## # A tibble: 2,205 × 8
##     year state county_name   urbanicity dem_group pop_total pop_jail      rate
##    <dbl> <chr> <chr>         <chr>      <chr>         <dbl>    <dbl>     <dbl>
##  1  1970 AZ    Apache County rural      total         17173     7    0.000408 
##  2  1970 AZ    Apache County rural      female         8886     0    0        
##  3  1970 AZ    Apache County rural      male           8287     4    0.000483 
##  4  1971 AZ    Apache County rural      total         18752     7.5  0.000400 
##  5  1971 AZ    Apache County rural      female         9698     0.12 0.0000124
##  6  1971 AZ    Apache County rural      male           9054     4.75 0.000525 
##  7  1972 AZ    Apache County rural      total         19818     8    0.000404 
##  8  1972 AZ    Apache County rural      female        10248     0.25 0.0000244
##  9  1972 AZ    Apache County rural      male           9570     5.5  0.000575 
## 10  1973 AZ    Apache County rural      total         21418     8.5  0.000397 
## # ℹ 2,195 more rows

And here we have it! A tidy data set that we lengthened, separated, and widened which we’re going to use to explore some additional features of {ggplot2}.

Making better ggplots

Last lesson introduced making static plots in R using {ggplot2}. Today, we’re going to expand on that introduction and demonstrate some additional features of {ggplot2}. We are still just scratching the surface of what is possible using {ggplot2}, but hopefully these concepts will get you excited to explore more features and make beautiful plots.

Let’s start by applying some of the things we learned last lesson and earlier in this lesson to a plot of jail incarceration rate by county from our tidy Arizona dataset. We’ll start with a line plot of all the counties and see what we can see.

az_jail_county_tidy |> 
  filter(dem_group == "total") |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line()

The first thing that jumps out at me is that something odd happened in one of the counties starting in 1999 that appears to be a data error. It’s a hard to tell which county this is because we have too many lines and colors in our plot. But this is a great example of why plotting your data is so important. We may not have realized there was something odd going on with this county if we didn’t see it. (We’ll make a note to investigate further what is going on with this county later.)

Because there are too many overlapping lines, we can’t really see much here, so a better plot might focus on a subset of counties. For instance, we can look at the five largest counties in the state by population and make the same plot.

big_counties <- c("Maricopa County", "Pima County", "Pinal County",
                  "Yavapai County", "Yuma County")

az_jail_county_tidy |> 
  filter(
    dem_group == "total",
    county_name %in% big_counties
    ) |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line()

Okay! We can start to see something now. First, jail incarceration rates have gone up in all counties since the 1980s, the peak was sometime around 2010, rates have fallen since then. Looks like Yavapai County had the highest rate and Pinal County had the lowest rate in the last year of data we have available.

Labels

One easy way to improve this plot is by adding a title, subtitle, and axis labels that tells us what we’re looking at

az_jail_county_tidy |> 
  filter(
    dem_group == "total",
    county_name %in% big_counties
    ) |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Jail incarceration rate",
    color = NULL
  )

Note that I’ve excluded x-axis and legend titles by using NULL in labs() because they are self-explanatory.

This is better, but we still have a couple things we could improve. First it is a little tricky to know which line is which county because the legend is not in the same order as the lines in the chart. Additionally, the y-axis labels are hard to interpret. To fix these these things we’ll convert county_name to a factor that is ordered by rate and change the y-axis labels to rate per 100,000 people.

library(scales)

az_jail_county_tidy |> 
  filter(
    dem_group == "total",
    county_name %in% big_counties
    ) |> 
  mutate(county_name = fct_reorder(county_name, rate, last, .desc = TRUE)) |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    )

Direct labeling

This is improving, but we can do better! One technique that is very useful is direct labeling, or adding text labels to the plot itself. In this case, we could add the county name to each line and remove the legend entirely! When using this technique, I find it easiest to assign the cleaned up and filtered data set I want to plot to a new data frame before I plot.

One good way to add text labels to a plot is with geom_text(). In the same way that we use geom_line() for a line plot or geom_point() for a scatterplot, geom_text() will plot text at specified points on a grid. For geom_line() to work, we need map text from a column in our input data to the label aesthetic.

big_county_to_plot <- az_jail_county_tidy |> 
  filter(
    dem_group == "total",
    county_name %in% big_counties
    ) |> 
  mutate(county_name = fct_reorder(county_name, rate, last, .desc = TRUE))

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    aes(label = county_name),
    size = 3,
    hjust = 0,
    nudge_x = 0.25
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    )

Errrrrr. That doesn’t look great. For our direct label plan to work, we don’t want one text label per point, we just want one per line. And I think it will look best if we put the label at the far right of each line. To do this, we can tell geom_text() to use a filtered version of the main data set that only includes the most recent year of data.

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    )

Alright, we’re getting somewhere. But we can’t really read the labels and we still have the legend that we wanted to get rid of. To be able to read the labels we’ve added to the plot we need to extend the area of plot that is visible to us. We do that by turning “clipping” off in our coordinate system and by adding a margin to whole plot. By default, ggplot2 will cut off anything outside the plotting area, but by setting clip = "off" we’ll be able to see the entire county name we are plotting. Additionally, we’ll turn the legend off and increase the right side margin of the plot to 75 pixels.

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 75, 5, 5)
  )

It’s not perfect, but I like this a lot better than where we started!

Themes

Now that we’ve added our direct labels, things look a little weird with the gray background of the plot. We can control the plot background, and a great many other things such as font face and size, position of labels, grid lines, etc. by modifying the ggplot theme. We already started working on the theme above when we got rid of the legend and increased the margin around the plot. There are huge number of arguments to theme() that you can explore and play with. And there is also a way to set a new theme for the entire plot. There are a handful of themes built into ggplot2 that change many aspects of a theme. Here for instance is theme_minimal():

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 75, 5, 5)
  )

And here is theme_classic():

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  theme_classic() +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 75, 5, 5)
  )

I tend to use theme_minimal() as a starting place and build from there. In this case, I think I’d like to get rid of the x-axis grid lines and make the y-axis title and captions a little smaller.

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 75, 5, 5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.title.y = element_text(size = 10),
    plot.caption = element_text(size = 8)
  )

Color palettes

While I’m working on the look of this, I think we might be able to use some better colors for our counties. If I don’t have a set color palette for a particular project I am working on, my first stop is Color Brewer, which has a number of very good palettes. Importantly, you can choose between sequential, diverging, and qualitative (or categorical) palettes depending on what kind of data you are plotting. In this case, we have qualitative data because there is no inherent ordering to the county names. The brewer palettes are built into {ggplot2} and we can access them using scale_color_brewer().

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line() +
  geom_text(
    data = filter(big_county_to_plot, year == 2018),
    aes(label = county_name),
    hjust = 0
  ) +
  scale_y_continuous(labels = label_comma(scale = 1e5)) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Jail incarceration rate by county",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = "Rate per 100k residents",
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  theme_minimal() +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 75, 5, 5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    axis.title.y = element_text(size = 10),
    plot.caption = element_text(size = 8)
  )

Alright! I’m almost very happy with this plot, and it certainly is a big improvement over where we started:

I’m going to make a few more little tweaks and get this to the point where I’d feel good about putting this chart in slide deck or report. See if you can spot the additional changes I’ve made.

Bringing it all together

big_county_to_plot <- az_jail_county_tidy |> 
  filter(
    dem_group == "total",
    county_name %in% big_counties
    ) |> 
  mutate(
    county_name = fct_reorder(county_name, rate, last, .desc = TRUE),
    text_label = paste(county_name, "–", comma(rate, 1, scale = 1e5))
    )

big_county_to_plot |> 
  ggplot(aes(year, rate, color = county_name)) +
  geom_line(size = 0.75) +
  geom_point(
    data = filter(big_county_to_plot, year == max(year)),
    size = 1.25
    ) +
  geom_text(
    data = filter(big_county_to_plot, year == max(year)),
    aes(label = text_label),
    size = 3.5,
    family = "Franklin Gothic Medium",
    hjust = 0,
    nudge_x = 0.5
    ) +
  scale_x_continuous(
   breaks = c(1970, 1980, 1990, 2000, 2010, 2018)
    ) +
  scale_y_continuous(
    labels = label_comma(scale = 1e5),
    expand = expansion(mult = c(0, 0.05))
    ) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Jail populations grew from the 1970s to a peak near 2010",
    subtitle = "Incarceration rate per 100k residents, select Arizona counties",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = NULL,
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  expand_limits(y = 0) +
  theme_minimal(base_family = "Franklin Gothic Book") +
  theme(
    legend.position = "none",
    plot.margin = margin(5, 85, 5, 5),
    panel.grid.major.x = element_blank(),
    panel.grid.minor.x = element_blank(),
    plot.caption = element_text(size = 8),
    plot.title = element_text(family = "Franklin Gothic Medium"),
    axis.ticks.x = element_line(color = "gray80"),
    axis.ticks.length = unit(1.5, "mm"),
    axis.text = element_text(size = 10)
  )

Facets

One additional neat feature of {ggplot2} is facets. These plots are sometimes called small multiples and can be used to show multiple versions of the same plot with different data. It can be a useful technique when you have too much data to include in a single plot.

In this case, we could revisit our first plot where there were too many counties to include all of them. Instead, we could create a faceted plot where we make one plot per county.

To do this, use facet_wrap() and specify which column in your data you want use to create the facets

az_jail_county_tidy |> 
  filter(dem_group == "total") |> 
  ggplot(aes(year, rate)) +
  geom_line() +
  facet_wrap(vars(county_name))

Now it’s easy to figure out the county with the big jump in 1999 (La Paz). By default, ggplot will use the same x and y scales for each facet. Sometimes this is useful if you want to compare the total in each facet with the other facets. Other times, you may be more interested in the relative change or trend across the facets. In this case, you can set the scales to be “free” so that the limits of each facet is set individually. Here, for example we can make the y-axis have free scales – notice the y-axis scales for each facet are now different.

az_jail_county_tidy |> 
  filter(dem_group == "total") |> 
  ggplot(aes(year, rate)) +
  geom_line() +
  facet_wrap(vars(county_name), scales = "free_y")

We can also do all our other fancy theming, labeling, and other changes from above, so that we might end up with a final plot like this:

az_jail_county_tidy |>
  filter(dem_group == "total", county_name != "La Paz County") |> 
  mutate(county_name = str_remove(county_name, " County")) |> 
  ggplot(aes(year, rate)) +
  geom_line(color = "steelblue") +
  scale_x_continuous(
   breaks = c(1970, 1995, 2018)
    ) +
  scale_y_continuous(
    labels = label_comma(scale = 1e5),
    expand = expansion(mult = c(0, 0.05))
    ) +
  scale_color_brewer(palette = "Set1") +
  labs(
    title = "Jail populations grew from the 1970s to a peak near 2010",
    subtitle = "Incarceration rate per 100k residents, Arizona counties",
    caption = "Vera Institute Incarceration Trends",
    x = NULL,
    y = NULL,
    color = NULL
    ) +
  coord_cartesian(clip = "off") +
  expand_limits(y = 0) +
  facet_wrap(vars(county_name)) +
  theme_minimal(base_family = "Franklin Gothic Book") +
  theme(
    legend.position = "none",
    panel.grid.major.x = element_blank(),
    panel.grid.minor = element_blank(),
    plot.caption = element_text(size = 8),
    plot.title = element_text(family = "Franklin Gothic Medium"),
    axis.ticks.x = element_line(color = "gray80"),
    axis.text = element_text(size = 7)
  )

With this kind of plot we get some different insights and begin to understand that the steep increase in jail incarceration rates was not evenly distributed across all Arizona counties.